AITopics | compute node

Collaborating Authors

compute node

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

Shashank Rajput, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

Neural Information Processing SystemsFeb-12-2026, 00:30:42 GMT

Neural Information Processing Systems http://nips.cc/

etox, gradient, node, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Wisconsin > Dane County > Madison (0.05)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > Canada (0.04)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Security & Privacy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

Neural Information Processing SystemsDec-25-2025, 07:31:03 GMT

To improve the resilience of distributed training to worst-case, or Byzantine node failures, several recent methods have replaced gradient averaging with robust aggregation methods. Such techniques can have high computational costs, often quadratic in the number of compute nodes, and only have limited robustness guarantees. Other methods have instead used redundancy to guarantee robustness, but can only tolerate limited numbers of Byzantine failures. In this work, we present DETOX, a Byzantine-resilient distributed training framework that combines algorithmic redundancy with robust aggregation. DETOX operates in two steps, a filtering step that uses limited redundancy to significantly reduce the effect of Byzantine nodes, and a hierarchical aggregation step that can be used in tandem with any state-of-the-art robust aggregation method. We show theoretically that this leads to a substantial increase in robustness, and has a per iteration runtime that can be nearly linear in the number of compute nodes. We provide extensive experiments over real distributed setups across a variety of large-scale machine learning tasks, showing that DETOX leads to orders of magnitude accuracy and speedup improvements over many state-of-the-art Byzantine-resilient approaches.

detox, redundancy-based framework, robust gradient aggregation, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.77)

Add feedback

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

Neural Information Processing SystemsDec-24-2025, 13:07:15 GMT

Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes. This approach, known as distributed training, can utilize hundreds of computers via specialized message-passing protocols such as Ring All-Reduce.However, running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters.In contrast, many real-world applications, such as federated learning and cloud-based distributed training, operate on unreliable devices with unstable network bandwidth.As a result, these applications are restricted to using parameter servers or gossip-based averaging protocols.In this work, we lift that restriction by proposing Moshpit All-Reduce -- an iterative averaging protocol that exponentially converges to the global average.We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees.The experiments show 1.3x speedup for ResNet-50 training on ImageNet compared to competitive gossip-based strategies and 1.5x speedup when training ALBERT-large on preemptible compute nodes.

communication-efficient decentralized training, heterogeneous unreliable device, moshpit sgd, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.61)

Add feedback

dfce06801e1a85d6d06f1fdd4475dacd-Paper.pdf

Neural Information Processing SystemsNov-21-2025, 13:18:00 GMT

artificial intelligence, bayesian inference, machine learning, (15 more...)

Neural Information Processing Systems

Country:

Asia > Japan (0.14)
Europe > Finland > Uusimaa > Helsinki (0.06)
North America > United States > California > Los Angeles County > Long Beach (0.04)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)

Add feedback

ATOMO: Communication-efficient Learning via Atomic Sparsification

Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, Stephen Wright

Neural Information Processing SystemsNov-20-2025, 15:36:48 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, deep learning, machine learning, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

Shashank Rajput, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

Neural Information Processing SystemsOct-2-2025, 14:58:38 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, etox, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Wisconsin > Dane County > Madison (0.05)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > Canada (0.04)

Industry: Information Technology > Security & Privacy (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

ClusterRCA: An End-to-End Approach for Network Fault Localization and Classification for HPC System

Sun, Yongqian, Pan, Xijie, Xiong, Xiao, Tao, Lei, Wang, Jiaju, Zhang, Shenglin, Yuan, Yuan, Li, Yuqi, Jian, Kunlin

arXiv.org Artificial IntelligenceSep-23-2025

Network failure diagnosis is challenging yet critical for high-performance computing (HPC) systems. Existing methods cannot be directly applied to HPC scenarios due to data heterogeneity and lack of accuracy. This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data. ClusterRCA extracts features from topologically connected network interface controller (NIC) pairs to analyze the diverse, multimodal data in HPC systems. To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches. A failure graph is constructed based on the output of the state classifier, and then it performs a customized random walk on the graph to localize the root cause. Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves high accuracy in diagnosing network failure for HPC systems. ClusterRCA also maintains robust performance across different application scenarios.

data mining, machine learning, node, (19 more...)

arXiv.org Artificial Intelligence

2506.20673

Country:

North America > United States (0.14)
Europe > United Kingdom (0.04)
Europe > Sweden > Uppsala County > Uppsala (0.04)
Asia > China > Tianjin Province > Tianjin (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Energy (0.47)
Telecommunications (0.47)
Information Technology (0.46)

Technology:

Information Technology > Scientific Computing (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.68)
(2 more...)

Add feedback

GPU-centric Communication Schemes for HPC and ML Applications

Namashivayam, Naveen

arXiv.org Artificial IntelligenceMar-31-2025

Compute nodes on modern heterogeneous supercomputing systems comprise CPUs, GPUs, and high-speed network interconnects (NICs). Parallelization is identified as a technique for effectively utilizing these systems to execute scalable simulation and deep learning workloads. The resulting inter-process communication from the distributed execution of these parallel workloads is one of the key factors contributing to its performance bottleneck. Most programming models and runtime systems enabling the communication requirements on these systems support GPU-aware communication schemes that move the GPU-attached communication buffers in the application directly from the GPU to the NIC without staging through the host memory. A CPU thread is required to orchestrate the communication operations even with support for such GPU-awareness. This survey discusses various available GPU-centric communication schemes that move the control path of the communication operations from the CPU to the GPU. This work presents the need for the new communication schemes, various GPU and NIC capabilities required to implement the schemes, and the potential use-cases addressed. Based on these discussions, challenges involved in supporting the exhibited GPU-centric communication schemes are discussed.

artificial intelligence, machine learning, opération, (13 more...)

arXiv.org Artificial Intelligence

2503.2423

Country:

North America > United States > Minnesota (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre:

Research Report (0.50)
Overview (0.34)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

Neural Information Processing SystemsJan-17-2025, 20:07:25 GMT

communication-efficient decentralized training, heterogeneous unreliable device, protocol, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.65)

Add feedback

The Artificial Scientist -- in-transit Machine Learning of Plasma Simulations

Kelling, Jeffrey, Bolea, Vicente, Bussmann, Michael, Checkervarty, Ankush, Debus, Alexander, Ebert, Jan, Eisenhauer, Greg, Gutta, Vineeth, Kesselheim, Stefan, Klasky, Scott, Pausch, Richard, Podhorszki, Norbert, Poschel, Franz, Rogers, David, Rustamov, Jeyhun, Schmerler, Steve, Schramm, Ulrich, Steiniger, Klaus, Widera, Rene, Willmann, Anna, Chandrasekaran, Sunita

arXiv.org Artificial IntelligenceJan-15-2025

Increasing HPC cluster sizes and large-scale simulations that produce petabytes of data per run, create massive IO and storage challenges for analysis. Deep learning-based techniques, in particular, make use of these amounts of domain data to extract patterns that help build scientific understanding. Here, we demonstrate a streaming workflow in which simulation data is streamed directly to a machine-learning (ML) framework, circumventing the file system bottleneck. Data is transformed in transit, asynchronously to the simulation and the training of the model. With the presented workflow, data operations can be performed in common and easy-to-use programming languages, freeing the application user from adapting the application output routines. As a proof-of-concept we consider a GPU accelerated particle-in-cell (PIConGPU) simulation of the Kelvin- Helmholtz instability (KHI). We employ experience replay to avoid catastrophic forgetting in learning from this non-steady process in a continual manner. We detail challenges addressed while porting and scaling to Frontier exascale system.

particle, radiation, simulation, (15 more...)

arXiv.org Artificial Intelligence

2501.03383

Country:

North America > United States > Delaware > New Castle County > Newark (0.14)
Europe > Germany > Saxony > Dresden (0.05)
North America > United States > Tennessee > Anderson County > Oak Ridge (0.04)
(2 more...)

Genre: Research Report > Promising Solution (0.46)

Industry:

Energy (0.93)
Government > Regional Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback